Famous Men and Women on Wikipedia

How do the Wikipedia pages of famous men differ from those of famous women?

Background

The initial interest in this project stemmed from an Atlantic article about the sexism of Wikipedia. The statistics were staggering. Less then 10 percent of Wikipedia editors are women and few of these editors are experienced (having made at least 500 edits).The lack of women editors leads to a lack of pages for notable women. For example,there are at least 4,400 female scientists who reach Wikipedia's standards of notability, but have not had pages created for them. Even when pages of notable women are made, studies show that such pages are more likely to mention their gender and relationships in comparison to the pages of notable men. We wanted to further explore the issue of differences between the Wikipedia pages of famous men and women in this project, by looking closely at the trends surrounding the Wikipedia pages of the 100 most famous men and women

Introduction

The main question that we set out to answer in this project was the following: How do the Wikipedia pages of famous women differ from those of famous men?

To answer this question, we first generated a list of the to 5 'most famous' women and men using Google's PageRank We chose to use PageRank as a metric by which to define fame based on the desire to use a source extrinsic to wikipedia to avoid biases that wikipedia metrics may have(although such biases in Google's algorithm are of course also possible). The top 10 most famous men and women based on this metric are listed below:

Rank Women Men
1 Elizabeth II Napoleon
2 Queen Victoria Barack Obama
3 Mary(mother of Jesus) George W. Bush
4 Elizabeth I William Shakespeare
5 Margaret Thatcher Jesus
6 Madonna(entertainer) Adolf Hitler
7 Hillary Clinton Franklin D. Roosevelt
8 Catherine the Great Aristotle
9 Beyonce Bill Clinton
10 Britney Spears Ronald Reagan

We were interested not in the specific content of the pages, but rather, features of these pages, and user interaction with these pages. With this in mind, we chose to investigate this question by looking at the following attributes:

  • The number of backlinks(links to the Wikipedia pages from outside pages)
  • The number of revisions per page
  • The size of revisions to these pages
  • The number of unique editors per page
  • The amount of text per page
  • The language used on these pages(main pages, as well as talk pages)

The top 5 most famous men and women on Wikipedia

We first decided to look at the Wikipedia pages of only the top 5 men and women.

One attribute we looked at was backlinks(the number of links to a given Wikipedia page from other Wikipedia pages). We intended to use this feature as one metric of the page's popularity, and connectedness to other pages. Results from the top 5 pages(for men and women) show that in general(with the exception of the most popular man and woman), the pages of famous men have more backlinks.

We next looked at the number of revisions per page(over the lifetime of the page). The results show a similar trend to that of the backlinks: that the number of revisions to the Wikipedia page of a famous man is greater than the number of revisions to the Wikipedia page of a famous woman of equal 'fame'(PageRank)

We looked at the number of unique per page(over the lifetime of the page).We wanted to see if the observed discrepancy in the number of revisions could be explained by a small number of editors making multiple edits. However, when we plotted the number of unique editors, we saw the same trend as we observed: a greater number of unique editors for pages of men with a given rank compared to the pages of women with equal rank. Thus, it seems like there are simply more edits and editors for these mens' pages
Finally, we looked at the amount of text(number of words) per page. The results generally show that length of text for the Wikipedia page of a famous man is greater than the number of revisions to the Wikipedia page of a famous woman of equal 'fame'(PageRank)

Analysis for the top 100 most famous men and women

Intrigued by our results from the top 5 most famous men and women, we decided to look at the same features for a larger population(100 men and women) to see if the trends that we observed generalize. One reason that we wanted to repeat this analysis for a greater number of people was that we noticed that the top 5 most 'famous' men and women(according to PageRank), represented rather specific categories, specifically British royalty and politicians(Elizabeth II, Queen Victoria, Elizabeth I, Margaret Thatcher), and religious figures(Jesus, Mary), which were not representative of the full top 100 list. To determine if the results that we observed were in fact a result of differences between the pages of famous men and women on Wikipedia, or if this was biased by other specific attributes of these 10 people, we expanded our analysis to 100 men and 100 women, which represented a more diverse group. You can see the top 100 men and women broken down into broad categories below:

The distribution of these categories among men and women is also interesting, and revealing about the types of professions that make men and women famous. For example, while the most common categories for famous men were Political(Historical)(pre-1900) (31%) and Political(Current) (post-1900) (28%), the most common category for famous women was Artists and Celebrities(Current) (post-1900) (51%).

Number of Backlinks

The mean number of backlinks to the Wikipedia pages of famous men is significantly greater than the number of backlinks to the Wikipedia pages of famous women(two-tailed t-test: t=-8.27, p<0.001)

Number of Revisions

The mean number of revisions to the Wikipedia pages of famous men is significantly greater than the number of backlinks to the Wikipedia pages of famous women(two-tailed t-test: t=-3.93, p=0.0001)

Number of Unique Editors

The mean number of unique editors to the Wikipedia pages of famous men is significantly greater than the number of editors to the Wikipedia pages of famous women(two-tailed t-test: t=-4.77, p<0.0001)

Amount of text(number of words)/ Page

The mean number of words/page for the Wikipedia pages of famous men is significantly greater than the number of words/page for the Wikipedia pages of famous women( two-tailed t-test: t=-4.39, p<0.0001)

Analysis for the Time 'Most Influential People' of 2017

One of the issues that we identified in choosing the most famous men and women based on PageRank was the diversity of people represented. For example, comparing one of the most famous men like Barack Obama(the number 1 person in the world for the past 8 years) to one of the most famous women like a Queen of England, may not be a fair comparison and doesn't accurately represent differences between famous men and women necessarily, but rather other characteristics of these figures. To try to get a more even dataset, we repeated the analysis above for Time Magazine's list of the 100 'Most Influential People' for 2017 . These are all contemporary figuress, which makes them more comparable(outliers like Donald Trump were removed from this analysis).. We also further broke down this list according to the categories categories that Time defines: Pioneers, Artists, Leaders, Titans, and Icons, and compared the Wikipedia pages of men and women within these categories.

The results that we obtained from this analysis were quite different than that done for famous men and women defined on the basis of PageRank. Comparing women and men across all categories, here was no significant difference in the number of backlinks(two-tailed t-test: t=-1.66, p=0.102), revisions(two-tailed t-test: t=-1.61, p=0.111), or editors( two-tailed t-test: t=-1.45, p=0.15) between the pages of men and women. There was a marginally significant difference in the number of words/page for the pages of these men and women, as shown in the graph below (two-tailed t-test: t=-1.99, p=0.048).

Below you can see the results fot the same metrics(number of backlinks, number of revisions, umber of unique editors, number of words/page) repeated for the 2017 Times list and separated by category. None of these inter-group differences were statistically significant.
Though this is a smaller dataset than that previously analyzed for tha PageRank data, this analysis suggests that perhaps the differences observed previously are not necessarily reflective of differences in the wikipedia pages of modern-day famous men and women, but rather, are a consequence of the different kinds of opportunities for fame that were historically available to men and women. For example, it may be the case that because there have been more famous male political figures than female political figures, and because the Wikipedia pages of famous political figures tend to be popular and extensive, that differences appear in the averaged metrics for male and female wikipedia pages. More research using larger and more controlled datasets of famous men and women will be needed in order to resolve this issue, but the analysis done here suggests that the interplay of fame, gender, and profession, and their influence on the Wikipedia pages of individual people, is more complex than originally thought.

Conclusions

From this analysis, we were able to draw the following conclusions:
  1. The mean number of backlinks to the Wikipedia pages of (the top 100, defined by PageRank) famous men is significantly greater than the number of backlinks to the Wikipedia pages of (the top 100)famous women (p<0.001)
  2. The mean number of revisions to the Wikipedia pages of (the top 100, defined by PageRank) famous men is significantly greater than the number of backlinks to the Wikipedia pages of (the top 100)famous women(p-0.0001)
  3. The mean number of unique editors on the Wikipedia pages of (the top 100, defined by PageRank) famous men is significantly greater than the number of unique editors on the Wikipedia pages of (the top 100) famous women(p<0.0001)
  4. The mean number of words/page for the Wikipedia pages of (the top 100,defined by PageRank ) famous men is significantly greater than the number of words/page for the Wikipedia pages of (the top 100) famous women(p<0.0001)
  5. The categories to which the 100 most famous men and women belong are very different.The largest category of famous men was historical(pre-1900) political figures(31%), while the largest category of famous women was current (post-1900) artists and celebrities(51%)
  6. These results do not hold when repeated for a more balanced and modern-day dataset of famous men and women(Time Magazine's Most Influential People of 2017 List)

Future Directions

Some potential future directions for this work include:

  • Looking at an expanded set of famous people
  • Repeating the analysis using a different metric to define the most famous men and women(such as page views or harmonic centrality)
  • Do within-category comparisons of famous women and men (e.g. comparing current famous female athletes and current famous male athletes) to eliminate the possible effect of category on our metrics
  • Compare the amount of vandalism between the pages of famous men and women
  • Analyze the talk pages for these top 100 women and men, including the number of comments on these talk pages, the number of users on these talk pages, and the nature of the interactions(contentious or civil)
  • Complete a more in-depth analysis of the language used on these pages, over an increased sample size, to see if there are certain words or phrases more commonly used on the main or talk pages of famous women compared with those of famous men
  • Use textual analysis to examine some of the features of the text of these pages, such as sentiment analysis. Can we predict whether a page belongs to a woman or a man based on the language used?
  • See if the trends that we observed for the 100 most famous men and women hold for men and women who are less 'famous' on Wikipedia

Code

You can view the code used to do this data analysis(and some other analyses not described above through the notebooks linked below: